Merge Path – Cache-Efficient Parallel Merge and Sort
نویسندگان
چکیده
Merging two sorted arrays is a prominent building block for sorting and other functions. Its efficient parallelization requires balancing the load among compute cores, minimizing the extra work brought about by parallelization, and minimizing inter-thread synchronization requirements. Due to the extremely low compute to memoryaccess ratio, it is also critically important to efficiently utilize the memory system: minimize memory traffic, maximize the cache hit rate and minimize cache-coherence related activity. We present a novel approach to partitioning the two sorted arrays into pairs of contiguous sequences of elements, one from each array, such that 1) each pair comprises any desired total number of elements, and 2) the elements of each pair form a contiguous sequence in the final merged sorted array. While the resulting partition and the computational complexity are similar to those of certain previous algorithms, our approach is different, extremely intuitive, and offers interesting insights. Based on this, we present a synchronization-free, cacheefficient merging (and sorting) algorithm. While we use CREW PRAM as the basis, our algorithm is easily adaptable to additional architectures. In fact, our approach is even relevant to sequential cache-efficient sorting. The new algorithm has been implemented both on the HyperCore many-core sharedcache architecture and on a sizable x86 system, with emphasis on cache efficiency. The algorithms and performance results are presented, along with important cache-related insights. Keywords-component; Cache Memories; Parallelism and concurrency; Parallel processors; Sorting and searching
منابع مشابه
IRWIN AND JOAN JACOBS CENTER FOR COMMUNICATION AND INFORMATION TECHNOLOGIES Merge Path – Cache-Efficient Parallel Merge and Sort
Merging two sorted arrays is a prominent building block for sorting and other functions. Its efficient parallelization requires balancing the load among compute cores, minimizing the extra work brought about by parallelization, and minimizing inter-thread synchronization requirements. Due to the extremely low compute to memoryaccess ratio, it is also critically important to efficiently utilize ...
متن کاملEfficient Oblivious Parallel Sorting on the MasPar MP-1
We address the problem of sorting a large number N of keys on a MasPar MP-1 parallel SIMD machine of moderate size P where the processing elements (PEs) are interconnected as a toroidal mesh and have 16KB local storage each. We present a comparative study of implementations of the following deterministic oblivious sorting methods: Bitonic Sort, Odd-Even Merge Sort, and FastSort. We successfully...
متن کاملProficient Pair of Replacement Algorithms on L1 and L2 Cache for Merge Sort
Memory hierarchy is used to compete the processors speed. Cache memory is the fast memory which is used to conduit the speed difference of memory and processor. The access patterns of Level 1 cache (L1) and Level 2 cache (L2) are different, when CPU not gets the desired data in L1 then it accesses L2. Thus the replacement algorithm which works efficiently on L1 may not be as efficient on L2. Si...
متن کاملGeometric Algorithms for Private-Cache Chip Multiprocessors
We study techniques for obtaining efficient algorithms for geometric problems on private-cache chip multiprocessors. We show how to obtain optimal algorithms for interval stabbing counting, 1-D range counting, weighted 2-D dominance counting, and for computing 3-D maxima, 2-D lower envelopes, and 2-D convex hulls. These results are obtained by analyzing adaptations of either the PEM merge sort ...
متن کاملSorting on a Massively Parallel System Using a Library of Basic Primitives: Modeling and Experimental Results
We present a comparative study of implementations of the following sorting algorithms on the Parsytec SC320 reconfigurable, asynchronous, massively parallel MIMD machine: Bitonic Sort, Odd-Even Merge Sort, Odd-Even Merge Sort with guarded split&merge, and two variants of Samplesort. The experiments are performed on 2up to 5-dimensional wrapped butterfly networks with 8 up to 160 processors. We ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012